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Abstract 

The rank ordered distribution of the codon usage frequencies for 109 eubacteria and 14 archaea 
is best fitted by a three-parameter function which is the sum of a constant, an exponential and a 
linear term in the rank n. The parameters depend (two parabolically) from the total GC content. 
The rank ordered distribution of the amino acids is fitted by a straight line. The Shannon entropy 
computed over all the codons is well fitted by a parabola in the GC content, while the partial 
entropies computed over subsets of the codons show peculiar different behavior. Moreover the sum 
of the codon usage frequencies over particular sets, e.g. with C and A (respectively G and U) 
as i-th nucleotide, shows a clear linear dependence from the GC content, exhibiting a conspiracy 
effect. 
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1 Introduction 



The genetic code is degenerate, refering to the fact that almost all the amino acids are encoded by 
'synonymous' codons, and this degeneracy is primarily found in the third position of the codon. Some 
codons are used much more frequently than others to encode a particular amino acid, the pattern of 
codon usage varying between species. In the last few years, the number of available data for coding 
sequences has considerably increased PP, allowing for analysis to look for regularities, correlations and 
general features over the whole exonic region. Recently, from an analysis of the rank distribution for 
codons, in RNA coding sequences ^ performed in many genes for several biological species, the existence 
of a universal, i.e. biological species independent, distribution law for codons for the eukaryotic code 
has been remarked |2j. Indeed, it was pointed out that the rank of codon usage probabilities follows 
a universal law, the frequency function of the rank ordered codons being very nicely fitted by a sum 
of an exponential, a linear part and a constant. Such a universal behaviour suggested the presence of 
general biases, one of which was identified with the total exonic GC content (denoted throughout the 
paper by ycC) < ycc ^ l)i which is well known to play a strong role in the evolutionary process. In 
fact, the values of the parameters appearing in the fitting expression, plotted versus the total exonic 
GC content of the biological species, were reasonably well fitted by a parabolic function of the GC 
content. It is worthwhile to recall that the determination of the kind of law the codon rank distribution 
follows is extremely interesting in the investigations of the nature of the evolutionary process, which 
has acted upon the codon distribution, i.e. the eventual presence of a bias. 

Possible origins of codon bias have been recognized either as the result of natural selection or 
as the result of mutational pressure acting on the whole genome, also known as neutral evolution 
theory [Sill]. Experimental evidence has showed now that both effects should occur (see e.g. for 
the influence of the biased mutation pressure in the perspective of the neutral theory of molecular 
evolution, [H] for exhibiting the role of external selective forces in the case of thermophilic bacteria, [7j 
for an analysis of genome-wide codon bias of bacterial species, jH] for a discussion on the relative role 
of mutational bias and translational selection for codon usage in many genomes, including bacteria 
ones, jH] for a model explaining trends in codon and amino-acid usage vs. GC content). Quantitative 
model of directional mutational pressure has been proposed to measure to some extent the relative 
role of selection constraints and neutrality ^1 . Many efforts have also been made to study codon 
bias among genes of a given genome, in particular the influence of the gene expression level jl2|lllS| [n]. 
of the tRNA abundance |15l I16j . of translational accuracy jl7j . see also [5] and references therein. 

It seemed to us interesting to continue our previous analysis on a 'trend across species' basis by 

focusing on the bacteria, one main interest of the present study arising from the wide variation of the 

total GC content ranging from 25% to 75 %, whereas the GC variation inside a bacterial genome is 

much smaller. Our results is that the rank ordered distribution of the codon usage frequencies for 

bacteria is best fitted by a three parameters function, which is the sum of a constant, an exponential 

and a linear term in the rank n. Two of the three parameters depend parabolically on the total exonic 

GC content. As the sum, over suitable sets below defined, of the codon usage frequencies is well fitted 

by a straight line in the GC content, we say that conspiracy effect has to be present. Moreover, we 

conjecture the existence of an averaged discrete symmetry in the codon usage frequencies, reasonably 

'^The DNA is constituted by four bases, the adenine (A), the cytosine (C), the guanine (G) and the thymine (T), this 
last one being replaced by the uracile (U) in the messenger RNA. A codon is defined as an ordered sequence of three 
bases. Coding sequences in RNA are characterized by their constituting codons. 
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confirmed by the data. We also calculate the rank ordered distribution of the 20 amino acids, which 
is satisfactorily fitted by a straight line in ycc- 

We compute the Shannon entropy (as defined in jJH]) and find that its behaviour in function 
of the exonic GC content is a parabola, whose apex is around the value 0.50 of the GC content, 
which is expected for the behaviour of the Shannon entropy for two variables. Moreover, the Shannon 
entropies for the codons, whose orders in rank are, respectively, in the ranges 1-15, 16-25 and 26-61, 
have peculiar features, which we comment below. 

The study of this paper was based on a sample of 109 bacteria and 14 archaea, with a codon 
statistics larger than 300 000. The data were taken from Codon Usage Tabulated from GenBank ^ 
(see also http://www.kazusa.or.jp/codon/), release 138 for eubacteria and mainly release 144 for 
archaea. 



2 Codon usage probabilities distribution 

Let us define the usage probability for the codon XZN {X, Z, N & {A, C, G, U}) 

P{XZN) = lim (1) 

ntot^co Ntot 

where nxzN is the number of times the codon XZN has been used in all considered processes, for a 
given biological species, and Ntot is the total number of codons used in the same processes. It follows 
that our analysis and predictions hold for biological species with sufficiently large statistics of codons. 
For each biological species, codons are ordered following the decreasing order of the values of their 
usage probabilities, i.e. codon number 1 corresponds to the highest value, codon number 2 is the next 
highest, and so on. We denote by f{n) the probability P{XZN) of finding XZN in the n-th position. 
Of course the same codon occupies in general two different positions in the rank distribution function 
for two different species. By plotting /(n) versus the rank we confirm that the data are best fitted by 
the kind of function we found in [2], i.e. the sum of an exponential function, a linear function and a 
constant: 

/(n) = ae-'''" - /?n + 7. (2) 

In the following analysis we shall not consider the three stop codons, whose function is very peculiar. 
Therefore the four parameters a, ?7, (3 and 7 are constrained by the normalisation condition 

i = - _ - 1891/? + 6I7 (3) 



where ^ < 1 is a number that is computed for any biological species from the codon usage frequency 
for the 61 encoding codons (note that the result is almost unchanged if the data are normalized to 
one summing over the 64 coding codons). ^ 

In the following, the parameters for the different fits have been computed using a best-fit procedure, 
the curve fit being based on the Levenberg-Marquardt algorithm • The coefficient is defined by 



^The procedure to fit f{n) in the present paper is slightly diflFerent from the one followed in H. In that paper the value 
of the parameter 7 was fixed to the value corresponding to a uniform distribution, i.e. 7 = 1/61 ~ 0.0164. Therefore 
we were left with only two free parameters, while in the present paper we use three free parameters. While the general 
feature of the fit are unchanged, the present fits are more accurate with a lower of about two orders of magnitude. 
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Figure 1: Codon rank distribution /(n) for Borrelia burgdorferi 
and the Pearson's R coefficient by 



R 



(5) 



where yi are the actual values, y[ are the calculated ones, y and y' are the means of the actual values 
and of the calculated ones respectively. Recall that R? represents the ratio of the explained variance 
on the total variance. Fits can be considered as satisfactory for values of R? > 0.8. 

In figs. EBl we report the plot of f{n) as a function of n for a few bacteria: Borrelia burgdorferi 
(GC = 28.8 %), Bacillus subtilis (GC = 44.4 %), Escherichia coli (GC = 51.8 %), and Ralstonia 
solanacearum (GC = 67.5 %). One remarks that the codon rank distributions show mainly a not far 
from uniform or slowly decreasing behaviour for most of the codons, and a peak of a small number of 
highly 'overrepresented' codons (codon bias contribution). ^ 

Note that the fits of the rank ordered distribution /(n) by a Yule law or Zipf law are unsatisfactory. 

Indeed, as emphasized in when the majority of points resides in the tail of the distribution, it 

is necessary to fit the whole range of data. From the previous discussion, we expect the parameters 

to depend on the total GC content of the genes region (here the total exonic GC content). We have 

investigated this dependence and we report in figs. |SHE1 the plots of the parameters a, /?, 7 and rj 

versus the total exonic GC content. In terms of the total exonic GC content ycc of the biological 

^Statistical tables specifying for each biological species the codon statistics, the total GC content, the values of 
the parameters a, j3, rj, computed by a best-fit procedure, the estimated errors Aa, A/3, A77 on these parameters, 
the corresponding estimators of goodness of the fit and B? can be found on the archives at the following address: 
http : //xxx . lanl . gov/abs/q-bio . GN/0507030vl. 
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Figure 4: Codon rank distribution /(n) for Ralstonia solanacearum 

species, one finds that the values of a and 7 are weh fitted by polynomial functions: 

a = ao + ai ycc + "2 Ugc (6) 

where 

ao = 0.250 ±0.017 , ai = -0.919 ± 0.071 , 02 = 0.939 ± 0.072 , (7) 
the goodness of the fit being given by 

= 0.013 and R = 0.768 , (8) 

and 

7 = 70 + 71 VGC + 72 Vgc (9) 

where 

^0 = -0.0376 ± 0.0057 , 71 = 0.268 ± 0.024 , 72 = -0.275 ± 0.024 , (10) 
the goodness of the fit being given by 

= 0.0015 and R = 0.728 . (11) 

The r] parameter shows a 'A-like' behaviour in terms of ycc^ centered on the value ycc = 0.50, while 
the values of the /? parameter are mainly in the range 3 10~^ < /3 < 5 10~^. Note that the errors on 
/? become large when ycc is far from the mean value 0.50. 

Of course we are not able to predict which codon occupies the n-th rank. Finally, let us remark 
that the total exonic GC content ?/gc has to satisfy the consistency condition 

ycc = ^ E ^^/(^) ' (12) 
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where the sum is over the set J of integers to which the 56 codons containing G and /or C nucleotides 
belong and di is the multiplicity of these nucleotides inside the i-ih. codon. 

3 Amino-acid rank distribution 

It is natural to wonder if some kind of universality is also present in the rank distribution of amino 
acids. From the available data for codon usage, we can immediately compute (using the bacterial 
code) the frequency of appearance of any amino acid F{n) (1 < n < 20) in the whole set of coding 
sequences. The calculated values as a function of the rank are satisfactorily fitted by a straight line, 

F{n) = F-Bn. (13) 

In terms of the total exonic GC content, the parameters B and F for eubacteria show a parabolic 
dependence: B = Bq + Bi ycc + ^2 Uqc where 

Bo = 0.00915 ± 0.00034 , Bi = -0.0232 ± 0.0014 , B2 = -0.0258 ± 0.0014 , (14) 

the goodness of the fit being given by 

x' 

and F = Fo + Fi ycc + -^2 Vgc where 

Fo = 0.145 ±0.004 , 

the goodness of the fit being given by 

x"^ = 48 10"^ and R = 0.911 . (17) 

One remarks that the most frequent amino acid is always above the line. This can be easily understood 
in the light of eq. Indeed, the most frequent amino acids get, in general, a contribution of the 

exponential term of © with a low value of n. 

Of course, the frequency of an amino acid is correlated to frequencies of its encoding codons given 
by ©• If the ranks of the encoding codons were completely random, we do not expect that their sum 
should take equally spaced values, as is the case in a regression line. Therefore, we can infer, for the 
biological species whose amino-acid frequencies are very well fitted by a line, the existence of some 
functional constraints (conspiracy effect) on the codon usage. An analysis of the influence of the total 
GC content on the amino-acids composition of the protein of 59 bacteria has been reported in |21j . 
where references to previous works on the subject can be found, see also ^1^^. The figures, reported 
in [2^, qualitatively agree with the linear behavior given by eq. (|13j) and show sensible quantitative 
differences with the expected frequencies, computed on the basis of the neutral model, where the 
frequency of an amino acid is the sum of the synonymous codon frequencies, the frequency of a codon 
being computed as the product of the probability of the three nucleotides (mean field model). This 
analysis already hints in the direction of the presence of functional constraints. 

However, the behaviour predicted by Q fits the experimental data very well, while the shape of 
the distribution of amino acids seems more sensible to the biological species. In fact, one can remark 
on many plots of the amino-acid distributions the existence of one or two plateaux, which obviously 
indicate an equal probabilities of use for some amino acids. Presently, we have neither any arguments 
to explain the uniform distribution of amino acids from the ranked distribution of the corresponding 
codons nor we know of any explanation of this pattern of distribution. 



4.4 10"^ and R = 0.911 , 



(15) 



Fi = -0.243 ± 0.015 , F2 = 0.270 ± 0.015 



(16) 
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Figure 9: Parameter B vs. GC content (eubacteria) 
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Figure 10: Parameter F vs. GC content (eubacteria) 
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4 The Shannon entropy 



We compute the Shannon entropy, 

n=61 

S = -Y1 /Wlog2/W, (18) 

n=l 

related to the codon rank distribution and plot it versus the total exonic GC content ycc for the sample 
of 109 eubacteria and 14 archaea, see fig. ^2 The contribution of the stop codons is negligeable and 
represents less than 0.5% of the total entropy. The Shannon entropy is very well fitted by a parabola, 

S = so + si ycc + S2 Ugc (19) 

where 

So = 2.670 ± 0.084 , si = 12.36 ± 0.35 , S2 = -12.73 ± 0.36 , (20) 
the goodness of the fit being given by 

= 0.329 and = 0, 916 . (21) 

Note that the parabola has its apex for ycc ~ 0.50, which is expected for the behaviour of the Shannon 
entropy for two variables (here GC and AU). We have computed the partial Shannon entropies for the 
codons whose orders in rank are, respectively, in the ranges 1-15, 16-25 and 26-61 and we report them 
in fig. ^2 versus the GC content. In any of the three sets, we have put the codons whose contributions 
to the entropy, with respect the GC content, have similar behaviour. Indeed, the codons of the first set 
are primarily influenced by the exponential term in the rank distribution 0, leading for the partial 
entropy Si-15 to a parabolic behaviour with positive curvature and minimum at 50% GC content. For 
the codons of the intermediate set, the exponential term and the last two terms are of the same order 
of magnitude, hence the partial entropy 516-25 is almost uniform with respect to the GC content. For 
the codons of the last set, the exponential term is completely negligeable, and since f{n) is very small, 
— /(n) log2 /(n) ~ /(n)/ln2. The trend of the partial entropy 526-61 is thus essentially given by the 
behaviour of the 7 parameter, hence the parabolic shape with negative curvature and maximum at 
50% GC content. 

The fact that the Shannon entropy is a parabola shows obviously that the codon distribution is 
not uniform: the uniform distribution corresponds to the maximal entropy S = log2 A^, independent 
of the GC content, where N is the number of considered codons (here N = 61). The concavity of the 
entropy is also well understood in this context in terms of deviations to the uniform distribution when 
Ugc is far from the mean value 50%. On the contrary, the partial entropy 6*1-15, which is associated 
to the 'overrepresented' codons, exhibits a convex shape. This feature can also be understood by 
realizing that in the case of ycc far from 50%, the overrepresented codons distribution would be closer 
to the expected value for a non biased distribution than in the case of ycc ~ 50%. 

5 The mystery of the straight hnes 

In recent papers |22[ I23j . it has ben remarked, for 129 eubacteria and for 19 archaea, that the codon- 
specific nucleotide frequency Pj^, i.e. the sum of the codon usage for any given nucleotide X in any 
fixed i-th position, that is, for example 

Pk = Yl PiXYZ) , (22) 

Y,Z 
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Figure 11: Shannon entropy and partial entropies vs. GC content: (1) total entropy, (2) partial 
entropy rank 1-15, (3) partial entropy rank 16-25, (4) partial entropy rank 26-61 
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is fitted, versus the GC content, by a straight line, with coefficients depending on the position and on 
the nature of the nucleotide and slightly different for eubacteria and archaebacteria (see Table ^ for 
eubacteria and Table I^J for archaea for the values of the slope and the axis intercept in the case of our 
sample of bacteria). ^ There are 12 codon-specific nucleotide frequencies, which are constrained by 
three normalisation conditions 



pi 



+ Pu + Pg + Pa 



1 



1,2,3. 



(23) 



The fact that the sum of the codon position-specific nucleotide frequencies is a linear function of GC 
is a mysterious feature, which becomes more mysterious in the light of the results of Sec. 2, where we 
have remarked that the parameters appearing in the rank ordered frequency ((21) are parabolic functions 
of GC. A conspiracy has to be present between the /(n), see eq. (HJ, for any bacteria such that the 
sum of f{n) over the 16-dimensional sets Q, depending on the considered bacteria and corresponding 
the codon position-specific frequencies produces a linear function of GC, that is 



aycc + b 



(24) 



Things become still more mysterious taking into accounts what we have remarked, guided by the 
mathematical structure of the crystal basis model of the genetic code [2H ■ Let us recall that in this 
model the four nucleotides are assigned to the 4-dim fundamental irreducible representation (1/2, 1/2) 
of Ug^o{so{4:)) = Uq^Q{sl{2) e sl{2) = Ug^o{slH{2) e sly (2)), (the lower labels H and V denote the 
two commuting s/(2)), with the state assignment, (in the following the first number denotes the value 
of the label J specifying the irreducible representation (irrep.) of sl{2) and the second one the value 
of the label J3 denoting the state in the irrep., J3 = J, J — 1, . . . , — J, 2J G Z_|_) 

C = (1/2, 1/2)^^,(1/2, l/2)v 

U = (l/2,-l/2)^^,(l/2,l/2)y 

G = (l/2,l/2)j^,(l/2,-l/2)y 

A = (1/2, -1/2)^^,(1/2, -l/2)v- 



(25) 



which, in matrix notation, can be written as 



C 
G 



U 
A 



®3 



The codons are the composite states of the 3-fold tensor product of the fundamental irreducible 
representation, that is 

/ C U 
VGA 

which can be written in matrix form as 



ccc 


ecu 


cue 


CUU 


UCC 


UCU 


UUC 


UUU 


CCG 


CCA 


cue 


CUA 


UCG 


UCA 


UUG 


UUA 


CGC 


ecu 


CAC 


CAU 


UGC 


UGU 


UAC 


UAU 


CGG 


CGA 


CAG 


CAA 


UCG 


UGA 


UAG 


UAA 


GCC 


GCU 


cue 


GUU 


ACC 


ACU 


AUG 


AUU 


CCG 


CCA 


GUG 


GUA 


ACG 


ACA 


AUG 


AUA 


GGC 


GCU 


GAG 


GAU 


ACC 


AGU 


AAC 


AAU 


GGG 


GGA 


GAG 


CAA 


AGG 


AGA 


AAG 


AAA 



(26) 



''Note that the difference of behaviour between eubacteria and archaea could stem from the fact that eubacteria (at 
least in our sample) are mainly mesophilic while archaea are essentially thermophilic. 
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Eq. (|23|) can be rewritten, for instance for i = 1, as 

Ph + Pk- 1/2 = -P^ -Ph + 1/2 (27) 

and, looking at the codon matrix, we remark that the l.h.s., respectively the r.h.s., is, up to =Fl/2, the 
sum of the two 4x4 matrices obtained by mirror inversion with respect to the center of the matrix, 
that is the exchange C ^ A, G — > U. ^ So we make the following conjecture. The sum of codon 
usage frequencies over a set of r codons (4 < r < 16), placed on the same side of the diagonals of the 
matrix, plus the mirror set is described by a straight line in ?/gc with a very small slope coefficient a 
(|a| < 0.1) and a constant term b equal to r/32. In other words, the sum of codon usage frequencies 
over a set of r codons, placed on the same side of the diagonals of the matrix with a plus the numerical 
factor — r/16, is equal, up a factor r/32, to the opposite of the sum of the codon usage frequencies 
over the set of r mirror codons. The mirror symmetry with respect to the secondary diagonal (resp. 
principal diagonal) is equivalent, in terms of ^/q_+o(s/_f/(2) © slv{2)) to the change of the sign of the 
third component of J-^h and J^y (resp. of the sign of the third component of Jay). Denoting 
P'^{XYZ) (resp. P^{XY Z)) the codon usage frequency of the i-th codon XY Z (resp. of the mirror 
codon XYZ), we make the ansatz that the following equation holds 

J2 (P\XYZ) + P\XYZ)) = aycc + ^ (28) 

where / is r-dimensional set of codons placed on the same side of the principal diagonal or secondary 
diagonal of the codon matrix . The mirror codon XY Z with respect to the secondary diagonal (resp. to 
the principal diagonal) has opposite values of J^^h and J^y of the XY Z codon (respectively, opposite 
values of J^y)- One can see from Table 01 that indeed the values of the axis intercept h are very close 
to the conjectured values r/32 (4/32 = 0.125, 8/32 = 0.250, 12/32 = 0.375). So our conjecture seems 
supported by the experimental data and suggests the existence of an averaged symmetry C ^ A, G 
— >UorC^G, U— >A for the codon usage frequencies. ^ Of course when 2r is equal to the number 
of the codons in the considered set, the above conjecture is trivially the normalisation condition. 

An analysis over our sample of 109 eubacteria, randomly drawing 4, 8, 12 codons in the 16- 
dim set of codons with a specified nucleotide in a fixed position, shows that the straight line be- 
haviour is surprisingly present already for 4 codons (see Table |SJ , where we present few examples 
which confirm the conjecture. In this table, N denotes the number of random drawings of n dif- 
ferent codons belonging to a given set / (e.g. in the first line of Table Ol / = {CA^A^'}, N,N' = 

A, C, G, U). For each drawing lot of codons in the set /, we expect the sum P{XYZ) + P{XYZ) 

XYZei 

to behave linearly in terms of the GC content ycC) with slope a and axis intercept b. We ob- 
serve that the distribution of the coefficients a and b is peaked around mean values a and b with 
standard deviations da and cJb, which are reported in Table 01 Two particular subsets have also 
been considered, Ip = {UNN', GUN, CCC, CCU, CCA, CAC, CAU, CAA, AUN, ACU, AAU} and /, = 
{CNN', GCN, GGC, GGG, GGU, GUC, GUG, GUU, UCN, UUC, UGC}, corresponding respectively to 
codons above the principal and secondary diagonals of the codon matrix (|26|) . We have summed the 
^The mirror inversion in the crystal basis formalism means change of the sign of J^.h and Js^v- 

®These symmetries have already appeared in the literature in a different context. Indeed Rumer |25| . as quoted in 
remarked in the sixties, that the first symmetry exchanges the 32 strong codons, which form quartets or the quartet 
subset of sextets, with the remaining 32 weak codons. This symmetry is sometimes referred to as the Rumer symmetry 
|2b|. The second symmetry has been remarked by |27| as the symmetry which transforms each of the octets into itself, 
an octet being a set of eight dinucleotides appearing or not appearing as the first two nucleotides in the quartets. 
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experimental codon usage of these codons and of the n mirror codons with respect to the secondary 
diagonal. As it can be read from TablesHandlSl no real meaningful difference appears between n < 16 
and n = 16 for the first six sets / of Table Ol which correspond to fix the C or G codons (hence the 
A or U codons) in first, second and third position respectively. Note that the number of possible 
different configurations is 1946 for n = 4 or 12 and 12870 for n = 8. Surprisingly, this behavior is 
more evident for codons with fixed nucleotides in third position, that is for codons encoding different 
amino acids, strongly suggesting strong constraints between amino acids. 

Finally, let us remark that such a behaviour does not show up any longer if one chooses for / sets 
which violates the diagonal symmetry (for example the first four lines or the first four columns of the 
codon matrix (|26j) , or the whole set of codons) and considers random drawings in the same conditions 
as above. 

So, we derive the conclusion that the sum of the codon usage frequencies over a suitable set of 
codons is a linear function of GC, showing a conspiracy effect between the f{n) similar to the one of 
eq. ((211), that is 



where now Q is any 2r-dimensional set of codons satisfying the above discussed characteristics. 

The results above were based on sums of codons of the type P{XYZ) + P(XYZ), the bar mean- 
ing the mirror codon obtained by the mirror symmetry C ^ A, G ^ U. It is interesting to see 
what happens if one considers instead sums of codons and their reverse complementary codons, i.e. 
P(XYZ) + P{ZYX) where the hat means the complementary rule C ^ G, A ^ U (in other words, 
summing codons on one strand and complementary codons on the other strand). A preliminary anal- 
ysis shows that the behaviour (|^S|) also holds, but the slope and axis intercept ranges are wider and 
the mean slope is not close to zero. Such an analysis could be instructive because it might shed some 
light on the DNA strand asymmetry |28| I29 | IHUl However, in the present work, we analyse the 
coding sequences extracted from the CUTG database, for which one does not known whether the 
coding sequence comes from a gene on the leading strand or on the lagging strand, and the intergenic 
(non coding) sequences were obviously not considered. Hence we are unable to comment on this point 
for the moment. 

6 Discussion and conclusion 

The distribution of the experimental codon probabilities for the total exonic region for a large sample 
of bacteria is well fitted by the law ((2j. The spectrum of the distribution is universal, but the codon, 
which occupies a fixed level, depends on the biological species. The universal form of the function 
/(n) strongly suggests the presence of a bias of very general origin. Indeed, in |32], a mutation model 
has been proposed where the intensity of the mutation matrix, essentially in one point mutation, 
depends from the variation of the labels identifying the states of the irrep. of C/g_>o(s/(2) © s^(2)) 
describing, in the crystal basis [2^, the codons. The numerically computed stationary solution of 
the master equation for the distribution of the 64 codons is nicely fitted by a function of the form 
eq. ((21). The model depends from the choice of the form of the fitness function and from the values 
of the arbitrary parameters appearing in the mutation matrix, but unrealistic choice of these values 
destroys the goodness of the fit. The remark of 23 on the mystery of the straight line in the codon 
position-specific frequencies as function of the total exonic GC content, raises, on the light of the 




(29) 
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previously discussed form of the rank ordered distribution, an even more intriguing question: which 
is the mechanism which ensures that the sum of the frequencies f{n), given by eq. ((21) over a set 
of codons, whose rank distribution generally depends on the biological species, is a linear function 
in j/GC? From this property, we say that there is a conspiracy between the different codons. These 
effects are more evident when we sum the frequencies of the codons with fixed third nucleotide, which 
implies constraints on the amino-acids distribution. We have also discussed in detail the structure of 
the Shannon entropy. It is commonly stated in the literature that the non-coding part of DNA exhibits 
more correlation than the coding part, which is in contradiction with what one would naively expect 
as the coding part is more subjected to functional constraints. The results of our work points out the 
existence of strong correlations in the exonic part, which very likely witness the existence of functional 
bias, worthwhile of further analysis, and which should be better interpreted and understood. 

It has been known for some time that the plots of the first/second codon position GC content versus 
the third codon position GC content (or versus the total GC content) show straight line correlations 
[SI 111! 1^ I33j. The straight line behaviour of the codon position-specific frequencies remarked in 
(see also Tables ^ and IS]) of course implies these correlations, but is stronger. In i9^, a four-parameter 
model is developed to explain the 64 codons and the 20 amino-acids usage frequencies in function of the 
GC content. Considering the sums of the eight codons with the same GC content in position (which 
correspond to the columns of our codon matrix (|2H|)). a satisfactory agreement between experimental 
values and theoretical curves is found. The authors conclude that the GC content primarily drives the 
codon usage rather than the inverse. Our results, concerning the GC dependence of the codon usage, 
although settled on different grounds, go in the same direction. 

As a last remark, let us make some comments on genes. Although the GC content varies much 
less inside a genome than across genomes for bacteria, it is instructive to see whether the straight line 
effect also occurs at the gene level. In the case of E. coli K12 for example, for which the mean GC 
content is 51.8%, most of the genes correspond to values of GC close to the mean value, but the tails 
of the distributions {X = C, G, U, A and i = 1,2, 3) show a trend towards straight lines. Surely, a 
more detailed study deserves attention. Analogous trend for Homo sapiens has been observed in |33j . 

Finally we have conjectured the existence of a quasi-symmetry with respect to the principal and 
secondary diagonal of the codon matrix, written in the form suggested by the crystal basis model 
of the genetic code [21]. Clearly these discrete symmetries can be formulated without any reference 
to the crystal basis, but they appear naturally in this mathematical modelisation. So, as conclusive 
remark, let us point out that crystal basis model seems to provide the kinematical variables useful to 
describe some properties of the genetic code. As it is well known, the use of appropriate variables in 
mathematics, in physics, and, very likely, also in biology is an essential step for facing effectively a 
problem. 
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Table 1: Regression coefficients for ^P(Xj) = aycc + b (eubacteria). 



/ V t / 


a, 


b 


R (Pearson) 


y P(CNN') 


0.424 


0.004 


0.949 


y P(ANN') 


-0.452 


0.490 


0.960 


x;-p(Giviv') 


0.283 


0.209 


0.928 




-0.255 


0.297 


0.955 


y P(NCN') 


0.239 


0.108 


0.925 


y P(iVAiV') 


-0.338 


0.467 


0.959 


^ P(iVGiV') 


0.215 


0.065 


0.955 


x; p(ivuiv') 


-0.117 


0.358 


0.898 


y P(iViV'C) 


1.067 


-0.257 


0.984 


y pInn'A) 


-0.928 


0.676 


0.985 


X P(iViV'G) 


0.770 


-0.130 


0.984 


x; p(iviv'u) 


-0.909 


0.711 


0.977 



Table 2: Regression coefficients for X P{Xi) = aycc + b (archaea). 





a 


b 


R (Pearson) 


Y,P{CNN') 


0.414 


-0.028 


0.951 


X P{ANN') 


-0.557 


0.569 


0.963 


J2 P{GNN') 


0.393 


0.173 


0.917 


X P{VNN') 


-0.249 


0.286 


0.901 




0.269 


0.079 


0.891 


J2 P{NAN') 


-0.285 


0.440 


0.841 


X P{NGN') 


0.190 


0.082 


0.859 


X P{N\JN') 


-0.174 


0.398 


0.845 




1.047 


-0.241 


0.982 


X P{NN'A) 


-0.923 


0.687 


0.988 


X P{NN'G) 


0.687 


-0.066 


0.943 


J2 pInn'v) 


-0.812 


0.619 


0.983 
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Table 3: Sum of frequencies for drawing lots of n codons belonging to some subset / of size A''. 



sum 


n 


N 


a 




b 




P{CNN') + P{CNN') 

N,N' 

idem 
idem 


4 


1800 


-0.007 


0.107 


0.123 


0.055 


8 
12 


4000 
1800 


-0.013 
-0.02 


0.12 
0.107 


0.247 
0.370 


0.064 
0.055 


P{NCN') + P{NCN') 

N,N' 

idem 
idem 


4 


1800 


-0.024 


0.113 


0.144 


0.060 


8 
12 


4000 
1800 


-0.048 
-0.074 


0.13 
0.11 


0.287 
0.432 


0.07 
0.06 


Y P{NN'C) + P{NN'C) 

N,N' 

idem 

ifkuii 


4 


1800 


0.034 


0.116 


0.105 


0.065 


8 

12 


4000 

1800 


0.067 
0.10 


0.134 
0.117 


0.21 

0.311 


0.074 
U.0G5 


Y P{GNN') + P{GNN') 

N,N' 

idem 
idem 


4 


1800 


0.007 


0.115 


0.126 


0.061 


8 
12 


4000 
1800 


0.013 
0.021 


0.132 
0.115 


0.253 
0.379 


0.070 
0.061 


Y P{NGN') + P{NGN') 

N,N' 

idem 
idem 


4 


1800 


0.025 


0.108 


0.106 


0.035 


8 
12 


4000 
1800 


0.050 
0.074 


0.124 
0.108 


0.211 
0.318 


0.063 
0.055 


y P(NN'G) + P(NN'G) 

N,N' 

idem 
idem 


4 


1800 


-0.035 


0.103 


0.145 


0.049 


8 
12 


4000 
1800 


-0.074 
-0.104 


0.120 
0.103 


0.293 
0.436 


0.057 
0.049 


Y P{NN'N") + P{NN'N") 

N,N',N" 

V P(NN'N") + PiNN'N") 

N,N',N" 

Y PiNN'N") 

N,N',N" 


16 
24 
32 


10000 
10000 
10000 


-0.0024 
-0.0042 
8 10-^ 


0.21 
0.245 
0.30 


0.50 
0.75 
0.50 


0.112 
0.128 
0.147 


Y P{NN'N") 

N,N',N"eI, 

y P(NN'N") 

N,N',N"£ls 

Y P{NN'N") + P{NN'N") 

N,N',N"eI, 


16 
16 
16 


10000 
10000 
10000 


0.45 
-0.445 
0.002 


0.20 
0.189 
0.18 


0.019 
0.48 
0.50 


0.088 
0.106 
0.094 


J2 P{NN'N") 

N,N',N"eIj, 

Y P{NN'N") 

N,N',N"Gla 

Y P{NN'N") + P{NN'N") 

N,N',N"eId 


16 
16 
16 


10000 
10000 
10000 


-0.232 
0.229 
-0.003 


0.189 
0.226 
0.18 


0.328 
0.173 
0.50 


0.098 
0.113 
0.094 
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